The analysis relies on the SportsStats dataset to study country-wide patterns in terms of medal earning.
The goal is to understand which countries are currently performing better and which countries are on the rise.
Additional information provided include an analysis on whether different NOCs rely on single high-talent individuals competing in more editions/events, and which parameters correlate better with winning medals.
The results of the analysis will be used in a sport TV broadcasting event occurring before the next Olympics, narrating a story of what the different performance records are and what to expect in the next editions.
SportsStats is a sports analysis firm partnering with local news and elite personal trainers to provide “interesting” insights to help their partners. Insights could be patterns/trends highlighting certain groups/events/countries, etc. for the purpose of developing a news story or discovering key health insights.
Female:
count 64402.000000 mean 168.012422 std 8.816888 min 127.000000 25% 163.000000 50% 168.000000 75% 174.000000 max 213.000000 Name: Height, dtype: float64
Male:
count 125277.000000 mean 179.276084 std 9.451385 min 127.000000 25% 173.000000 50% 180.000000 75% 185.000000 max 226.000000 Name: Height, dtype: float64
(Similar considerations are valid for weight measurements, which are not the focus of this report)
The framework section describe the work that was done to add features in order to carry out the desired analysis.
The framework section includes:
The results for the table are shown in the cell below (first 5 rows).
| Country Name | Country Code | Year | Population | |
|---|---|---|---|---|
| 2 | Afghanistan | AFG | 1960 | 8996967 |
| 271 | Afghanistan | AFG | 1961 | 9169406 |
| 540 | Afghanistan | AFG | 1962 | 9351442 |
| 809 | Afghanistan | AFG | 1963 | 9543200 |
| 1078 | Afghanistan | AFG | 1964 | 9744772 |
The results for the table are shown in the cell below (first 5 rows).
| NOC | region | alpha_3 | iso_names | |
|---|---|---|---|---|
| 0 | AFG | Afghanistan | AFG | Afghanistan |
| 1 | AHO | Curacao | CUW | Curaçao |
| 2 | ALB | Albania | ALB | Albania |
| 3 | ALG | Algeria | DZA | Algeria |
| 4 | AND | Andorra | AND | Andorra |
A preliminary work consists in grouping by height and sex, ordering by ascending height. While grouping, the number of medals (or no medals) are counted (first rows are shown in the cell below).
Let's take a look at how populated are the different height classes.
| Height | Sex | Gold | Silver | Bronze | NoMedal | TotalPartecipants | |
|---|---|---|---|---|---|---|---|
| 0 | 127 | M | 0 | 0 | 0 | 1 | 1 |
| 1 | 127 | F | 0 | 0 | 0 | 6 | 6 |
| 2 | 128 | M | 0 | 0 | 0 | 1 | 1 |
| 3 | 130 | M | 0 | 0 | 0 | 2 | 2 |
| 4 | 131 | F | 0 | 0 | 0 | 2 | 2 |
| Height | TotalPartecipants | |
|---|---|---|
| count | 169.000000 | 169.000000 |
| mean | 173.538462 | 1122.360947 |
| std | 25.634333 | 1684.239232 |
| min | 127.000000 | 1.000000 |
| 25% | 152.000000 | 16.000000 |
| 50% | 173.000000 | 231.000000 |
| 75% | 194.000000 | 1783.000000 |
| max | 226.000000 | 9455.000000 |
On average, each class (distinct value of height, for example 175cm or 137cm), contains 1122 athletes. However, some classes contain only one athlete! As I want run the correlation analysis between height and success frequency, I do not want classes that are not enough populated in my analysis (outliers). For example, the male 127cm class contains only one athlete: if he succeeded or not, it is likely irrelevant for the analysis. As a filtering strategy for outliers, I excluded height classes with less than 5.2 athletes (which is the 10% quantile of the classes population).
I use the MedalsPerPartecipant metric to evaluate the success in the different height classes. For example, if there are 20 athletes at 210cm, and 15 of them were awarded a medal, the MedalsPerPartecipant in the class would be 0.75.
NOTE: The metric could likely be further refined to account for highly (and lowly) populated classes, but I will leave it for future investigation.
For each country, how many people were needed to win a medal (based on each Olympic year). For example, a country that in 2012 had 1,000,000 population and 1 Olympic medal (in 2012), would have a PopulationPerMedal_thousands metric of 1,000.
For each country, how many events partecipation were needed to win a medal (per Olympic year). For example, a country that in 2016 parecipated to 300 events and won 10 medals would have a EventPartecipationPerMedal metric of 30.
Similar to EventPartecipationPerMedal metric, but based on the country distinct athletes competing.
To evaluate which NOCs are more reliant on a single athlete performance (one athlete performing in many events). A ratio of 1 means there is a different athlete competing in every event. A ratio that approaches 0 means one same athlete is competing in all the events.
The descriptive stats are shown in the cell below. Some considerations follow.
| Year | events_partecipations_no | distinct_athletes_no | Bronze | Gold | NoMedal | Silver | TotalPartecipants | TotalMedals | Population | PopulationPerMedal_thousands | EventPartecipationPerMedal | AthletePerMedal | AthletePerEventPartecipation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 861.000000 | 861.000000 | 861.000000 | 861.000000 | 861.000000 | 861.000000 | 861.000000 | 861.000000 | 861.000000 | 8.610000e+02 | 8.610000e+02 | 861.000000 | 861.000000 | 861.000000 |
| mean | 1994.466899 | 162.288037 | 119.869919 | 4.441347 | 3.996516 | 66.403020 | 3.982578 | 78.823461 | 12.420441 | 5.759738e+07 | 1.661976e+04 | 25.607903 | 19.748040 | 0.785246 |
| std | 15.737095 | 164.983912 | 119.404391 | 6.834275 | 8.565275 | 50.906791 | 7.022451 | 67.263930 | 21.663220 | 1.654930e+08 | 7.955359e+04 | 24.845350 | 18.492840 | 0.119667 |
| min | 1964.000000 | 3.000000 | 3.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 3.000000 | 1.000000 | 5.320000e+04 | 5.024067e+01 | 2.800000 | 2.384615 | 0.355263 |
| 25% | 1984.000000 | 43.000000 | 36.000000 | 1.000000 | 0.000000 | 27.000000 | 0.000000 | 30.000000 | 2.000000 | 5.591572e+06 | 1.012278e+03 | 11.560000 | 8.684211 | 0.713287 |
| 50% | 1996.000000 | 95.000000 | 74.000000 | 2.000000 | 1.000000 | 52.000000 | 1.000000 | 57.000000 | 4.000000 | 1.476009e+07 | 2.463843e+03 | 18.000000 | 14.142857 | 0.785571 |
| 75% | 2008.000000 | 232.000000 | 163.000000 | 5.000000 | 4.000000 | 94.000000 | 4.000000 | 107.000000 | 13.000000 | 4.923058e+07 | 9.222692e+03 | 31.000000 | 24.000000 | 0.870370 |
| max | 2016.000000 | 839.000000 | 648.000000 | 46.000000 | 82.000000 | 234.000000 | 69.000000 | 322.000000 | 195.000000 | 1.378665e+09 | 1.129623e+06 | 281.000000 | 174.000000 | 1.000000 |
As these metrics are on a per medal basis, countries with no medals awarded are excluded from this section.
I want to define position/velocity/acceleration-like features to describe how countries are faring now, what’s their performance trend, and how quickly they are improving their record respectively.
The position feature is given by the number of total medals earned in a given year. For example, in 2016 Italy was awarded 28 total medals.
The velocity feature is given by the derivative of the number of total medals, evaluated as a discrete difference between two consecutive years.
In principle:
$$v = \frac{{\text{Medals}}_{\,2016}-{\text{Medals}}_{\,2012}}{4\;\text{years}}$$However, that is a first order approximation. I used the gradient function provided by the numpy library (https://numpy.org/doc/stable/reference/generated/numpy.gradient.html) for a second-order approximation of the (centered) derivative, with first-order approximation at the endpoints.
Similarly, I evaluated the acceleration as the derivative of velocity.
I am not particularly interested in the absolute values of velocity (and acceleration), but only to how they compare with the velocities (and accelerations) of other countries performance.
I want to create the framework to analyze the success of different countries based on the total number of medals awarded
Test strategy: group the data frame by NOC, Sport, Event, Year, Season, Games, Team, Medal. The reason is that I could have athletes from the same NOC on the podium for the same competition. As such, the instances need to be kept separated for a reliable count. Example:
Once the data above is collected, I will group by NOC and count the total number of medals awarded. I will join this table with the original data frame, where I can count the total number of athletes representing every NOC
I will need to validate my results. I will compare the results from my query with the Medal records from Wikipedia for 2 Summer editions
| iso_names | TotalMedals | |
|---|---|---|
| 0 | United States | 121 |
| 1 | China | 70 |
| 2 | United Kingdom | 67 |
| 3 | Russian Federation | 56 |
| 4 | Germany | 42 |
| 5 | France | 42 |
| 6 | Japan | 41 |
| 7 | Australia | 29 |
| 8 | Italy | 28 |
| 9 | Canada | 22 |
Total no. of medals awarded: 973
| iso_names | TotalMedals | |
|---|---|---|
| 0 | United States | 103 |
| 1 | China | 88 |
| 2 | Russian Federation | 82 |
| 3 | United Kingdom | 65 |
| 4 | Germany | 44 |
| 5 | Japan | 38 |
| 6 | Australia | 35 |
| 7 | France | 35 |
| 8 | Italy | 28 |
| 9 | Korea, Republic of | 28 |
Total no. of medals awarded: 962
The dataset in use does not account for medal revoking/reassignments. As such, it might not represent accurately the final ranking.
For instance, in the 2012 edition Russia was revoked of 14 medals, bringing the total from 82 to 68. It is a large discrepancy. I will highlight in my results this consideration, with the provided rankings based on the “day after” rankings, and not accounting for later doping/scandals reassignments.
Feel free to explore the data by hovering and using the dropdown on the left side of the figures.
From the top 5 countries in 2016 we can see that it is not always the largest countries that win more medals, but the ones with more partecipating athletes. This is expected, as more events attendance (and more athletes partecipation) is granted to top teams who pass the qualification tournaments.
From the map, we can see that the hosting country always performs much better (counting on the home support and a larger number of qualified athletes in virtue of the hosting status). For example, China in 2008 and United Kingdom in 2012 performed much better than usual.
Please note that the terrific performance of Russia in 2012 has largely been invalidated due to doping violations (14 medals were removed).
Top 5 and bottom 5 countries in 2016 (based on total medals):
| NOC | iso_names | Year | TotalMedals | TotalPartecipants | population_display | |
|---|---|---|---|---|---|---|
| 0 | USA | United States | 2016 | 121 | 321 | 323,071,755 |
| 1 | CHN | China | 2016 | 70 | 239 | 1,378,665,000 |
| 2 | GBR | United Kingdom | 2016 | 67 | 224 | 65,611,593 |
| 3 | RUS | Russian Federation | 2016 | 56 | 202 | 144,342,397 |
| 4 | FRA | France | 2016 | 42 | 212 | 66,724,104 |
| ... | ... | ... | ... | ... | ... | ... |
| 199 | IVB | Virgin Islands, British | 2016 | 0 | 4 | 29,355 |
| 200 | KGZ | Kyrgyzstan | 2016 | 0 | 17 | 6,079,500 |
| 201 | KIR | Kiribati | 2016 | 0 | 3 | 112,529 |
| 202 | ALB | Albania | 2016 | 0 | 6 | 2,876,101 |
| 203 | ZIM | Zimbabwe | 2016 | 0 | 13 | 14,030,338 |
204 rows × 6 columns
A possible interpretation of the velocity field is:
Velocity is defined as the number of added medals per year from one Olympics edition to the next
For example, USA velocity in 2016 is 4.5, as they added 18 medals from the 2012 edition to the 2016 edition. We can see that in the top 5 of this ranking there are 2 teams that were present in the top medals ranking shown above, and quite remarkably Team USA is number one in both. Forthermore, the velocity for Team USA is about double the second ranked Uzbekistan.
We can notice that more smaller countries are more visible in this ranking.
Countries that had a bad performance in 2012 and performed well in 2016 will have a more positive velocity record: this effect is more balanced in years <2016, as 2016 is the endpoint of our range, and as such it presents only a first-order approximation of the derivative (while the other years presented a second-order approximation centered derivative which is often a more accurate metric).
On the contrary, countries with a good performance in earlier years (maybe due to hosting) and a worse recent performance are at the bottom of this ranking. Countries that earned a large amount of medals (such as China and Russia) are also the ones that risk "losing" more.
| NOC | iso_names | Year | velocity | |
|---|---|---|---|---|
| 0 | USA | United States | 2016 | 4.50 |
| 1 | UZB | Uzbekistan | 2016 | 2.50 |
| 2 | AZE | Azerbaijan | 2016 | 2.00 |
| 3 | FRA | France | 2016 | 1.75 |
| 4 | DEN | Denmark | 2016 | 1.50 |
| ... | ... | ... | ... | ... |
| 199 | AUS | Australia | 2016 | -1.50 |
| 200 | KOR | Korea, Republic of | 2016 | -1.75 |
| 201 | UKR | Ukraine | 2016 | -2.25 |
| 202 | CHN | China | 2016 | -4.50 |
| 203 | RUS | Russian Federation | 2016 | -6.50 |
204 rows × 4 columns
Acceleration cannot be explained as easily, but it tells how quickly we can change the country current trend (how fast we can change velocity).
We can notice once again that 2 of the teams in this top 5 ranking were already present in the top 5 medal ranking, and 3 of the teams were present in the top 5 velocity ranking. We only have 2 new entries in this ranking.
Once again Team USA is at the first place, indicating quite a remarkable moment for sport in the US. The UK (hosting country in 2012) has a decreased record, quite expected in the years after hosting. Russia has the worst record, following a great performance in 2012, but it is paying the aftermath of the doping scandals (that partially altered its 2012 record).
| NOC | iso_names | Year | acceleration | |
|---|---|---|---|---|
| 0 | USA | United States | 2016 | 0.78125 |
| 1 | FRA | France | 2016 | 0.40625 |
| 2 | UZB | Uzbekistan | 2016 | 0.40625 |
| 3 | CUB | Cuba | 2016 | 0.21875 |
| 4 | SUI | Switzerland | 2016 | 0.18750 |
| ... | ... | ... | ... | ... |
| 199 | JPN | Japan | 2016 | -0.31250 |
| 200 | HUN | Hungary | 2016 | -0.34375 |
| 201 | IRI | Iran, Islamic Republic of | 2016 | -0.43750 |
| 202 | GBR | United Kingdom | 2016 | -0.46875 |
| 203 | RUS | Russian Federation | 2016 | -1.12500 |
204 rows × 4 columns
Smaller countries generally have higher athlete/event ratio, going against my initial belief that smaller countries would bring less athletes to Olympics (and have them competing in multiple events) due to budget reasons. The opposite is quite true, with larger athletes delegations (like Russia, China, USA) competing in many more events with the same athletes.
Many of the large countries consistently rely on single athletes competing in multiple events. This was initially considered as a grim indicator, however, Team like USA are performing well, with a positive trend, and rapidly increasing dominance. It is therefore unlikely that this metric is a useful indicator.
| NOC | iso_names | AthletePerEventPartecipation | |
|---|---|---|---|
| 0 | JOR | Jordan | 1.000000 |
| 1 | GRN | Grenada | 1.000000 |
| 2 | UAE | United Arab Emirates | 1.000000 |
| 3 | KOS | None | 1.000000 |
| 4 | TJK | Tajikistan | 1.000000 |
| ... | ... | ... | ... |
| 80 | JAM | Jamaica | 0.727273 |
| 81 | NED | Netherlands | 0.720365 |
| 82 | RUS | Russian Federation | 0.699507 |
| 83 | SUI | Switzerland | 0.671053 |
| 84 | TTO | Trinidad and Tobago | 0.651163 |
85 rows × 3 columns
The entire map is washed away by the negative record of India: no matter its large population, the country is still largely unsuccessful in Summer Olympics. Excluding India from this visualization could be helpful to better highlight the difference between other countries, as India could be considered an outlier with one medal every 660M people. Smaller countries dominate the bottom part of the ranking, with Grenada having one medal each 110k people. Albeit this was expected, it does not make it any less impressive or respectable.
| NOC | iso_names | PopulationPerMedal_thousands | |
|---|---|---|---|
| 0 | IND | India | 662258.625 |
| 1 | NGR | Nigeria | 185960.244 |
| 2 | PHI | Philippines | 103663.812 |
| 3 | INA | Indonesia | 87185.462 |
| 4 | VIE | Viet Nam | 46820.2175 |
| ... | ... | ... | ... |
| 80 | DEN | Denmark | 381.867333 |
| 81 | JAM | Jamaica | 264.203818 |
| 82 | NZL | New Zealand | 261.894444 |
| 83 | BAH | Bahamas | 188.9615 |
| 84 | GRN | Grenada | 110.263 |
85 rows × 3 columns
Portugal was the worst performer in this category in year 2016, with only one medal earned with 90 athletes.
It is once again to see larger delegations, like USA and Russia, presenting an outstanding record of one medal every 5 athletes competing. On average this means that 20% of their athletes come back from the Games with a medal.
| NOC | iso_names | AthletePerMedal | |
|---|---|---|---|
| 0 | POR | Portugal | 90.000000 |
| 1 | NGR | Nigeria | 71.000000 |
| 2 | AUT | Austria | 71.000000 |
| 3 | IND | India | 56.000000 |
| 4 | FIN | Finland | 54.000000 |
| ... | ... | ... | ... |
| 80 | RUS | Russian Federation | 5.071429 |
| 81 | ETH | Ethiopia | 4.625000 |
| 82 | USA | United States | 4.586777 |
| 83 | PRK | Korea, Democratic People's Republic of | 4.428571 |
| 84 | AZE | Azerbaijan | 3.111111 |
85 rows × 3 columns
Comparisons with similar-sized countries can only be carried out with average or small size countries. In fact, the highest-ranked countries like USA, Russia, and China are not really comparable in size while they are in terms of performance. The analysis considers countries with the same size $\pm$ 20%.
Performance of Team Italy have been in the recent years consistently subpar with respect to UK, France, and South Korea.
United States size compares only with Indonesia, the latter having a much worse record and hardly comparable to the one of USA.
Number of athletes competing and medals earned have a Pearson correlation coefficient r of 0.81.
This means that the number of athletes is strongly correlated to the number of medals that they will earn.
The p-value is 9.2e-16; as p-value <0.001, there is strong confidence in the result.
The result can be expected, as big delegations like USA and China generally take home many medals. However, this offers an interesting insight about the hosting country. Competing in the home country always gives the advantage of competing in front of many home supporters in familiar venues, but it also offers a granted competing spot in every event. As the number of athletes and medals are correlated, is hosting a "sure" way to get more medals?
NOTE: The correlation is very similarly value for the number of events attended.
Women Height and number of medals awarded (on a per event partecipation basis) have a Pearson correlation coefficient r of 0.78. This means that height is positively correlated with the number of medals earned. I was expecting a correlation, but not as strong, especially considering that for sports like gymnastic height can be a disadvantage. The p-value is 9.2e-16, that is p-value <0.001, indicating a strong confidence in the result.
The results are valid also for men (r=0.76, p-value=2.1e-15).
Filter out heights that have less than 5.2 datapoints
United States and France are not only some of the most awarded team in recent editions (position-like feature), but they are also on a positive streak in the last editions (velocity-like feature), and they are quickly increasing their dominance (acceleration-like feature).